Sociology 229: Advanced Regression
Assignment #3: EHA Basics
Due: Start of class, February 2
This assignment requires a dataset on the course website
entitled “Assignment 3 GSS2006subset.dta” and an accompanying do-file.
- Download
the dataset in STATA
- Create
your own “do” file that opens the data
- My syntax
creates some variables and makes some survivor, hazard, and integrated
hazard plots. See if you can get
that same syntax to run on your computer without error. Make your own do-file, don’t just use
mine!
- Note: Don’t worry if you don’t understand the
“stset” command. We’ll discuss that later.
- Note2: I’ve created a dummy variable that
identifies people born prior to 1960.
(I suspected that their timing of first childbirth might differ
from people born recently.) I was
later able to make plots that break out groups based on values of that
dummy variable.
- I
created several other variables that may be useful.
- NOTE: ‘parentsincome’
refers to family income at age 16.
- Examine
survivor, hazard, and integrated hazard plots for two or more other
interesting subgroups in the data.
You can either create your own new dummy variable, or use one that
I create (such as gender or having “rich parents”).
- Run a
basic Cox regression model looking at the effects of gender, race (white
as omitted category), parent’s income and mother’s education on the hazard
rate of having a first child.
- Answer
questions below.
Question 1: Write a
few sentences describing the survivor, hazard, and integrated hazard
plots. What do they tell you about the
timing of childbirth in the US? What is the overall shape? When is the rate highest? About what proportion never have a first
child? (4-5 sentences are sufficient,
but you can write more if you wish.)
Question 2: How does
the timing of childbirth differ for people in the pre-1960 cohort (versus born
after 1960)?
Question 3: What
categorical variable did you create? Why
did you expect those groups to differ in the timing of childbirth? What did you observe in your plots? Was it what you expected?
Question 4: Summarize
the findings of the Cox model. Interpret
the coefficient for “dfemale” by exponentiating
to determine the impact of gender on the hazard rate of having a first child.
Question 5: Notice
that I did not include a measure for the respondent’s income (or education or
anything else that changes over time).
That is because the simple data structure of this dataset does not allow
for independent variables to change over time.
We know an individual’s income at the time of the survey – which could
be long after they actually had their first child. Why might that cause problems for this
analysis? Or, more specifically, how
might the inclusion of such a variable bias the results?
Turn in the following:
- A
hazard plot of first childbirth, broken out by the variable of your choice
(Step 4)
- Results
of the Cox model
- Answers
to the questions